8 research outputs found

    Automatic parallel implementations of adjoint codes for structured mesh applications

    Get PDF
    Algorithmic Differentiation (AD) shown to be an essential tool to get sensitivity information for va in multiple areas of science such as Computational Fluid Dynamics (CFD) applications or finance. Yet there is no sufficient tool to ease the cost of providing performance portable AD codes, especially for modern hardware like GPU clusters. This paper sketches our plans and progress so far to extend the OPS framework with an adjoint tape (storage for descriptors of intermediate steps and intermediate states of variables) and shows preliminary performance results on CPU nodes. The OPS (Oxford Parallel library for Structured mesh solvers) has shown good performance and scaling on a wide range of HPC architectures. Our work aims to exploit the benefits of OPS to provide performance portable adjoint implementations for future structured mesh stencil applications using OPS with minimal modifications

    Bitwise Reproducible task execution on unstructured mesh applications

    Get PDF
    Many mesh applications use floating point arithmetic which do not necessarily hold the associative laws of algebra. This could cause the application to become unreproducible. In this paper we present some work on generating a method for unstructured mesh applications to provide bitwise reproducibility between separate runs, even if they are started with different number of MPI processes. We implement our work in the OP2 domain-specific library, which provides an API that abstracts the solution of unstructured mesh computations. We carry out a performance analysis of our method applied on two applications: a simple airfoil application, and a more complex Aero application which uses a finite element method and a conjugate-gradient algorithm. We show a 2.37Ă—to 1.49Ă— slowdown on this applications as a price for full bitwise reproducibility

    Loop Tiling in Large-Scale Stencil Codes at Run-time with OPS

    Get PDF
    The key common bottleneck in most stencil codes is data movement, and prior research has shown that improving data locality through optimisations that schedule across loops do particularly well. However, in many large PDE applications it is not possible to apply such optimisations through compilers because there are many options, execution paths and data per grid point, many dependent on run-time parameters, and the code is distributed across different compilation units. In this paper, we adapt the data locality improving optimisation called iteration space slicing for use in large OPS applications both in shared-memory and distributed-memory systems, relying on run-time analysis and delayed execution. We evaluate our approach on a number of applications, observing speedups of 2Ă—\times on the Cloverleaf 2D/3D proxy application, which contain 83/141 loops respectively, 3.5Ă—3.5\times on the linear solver TeaLeaf, and 1.7Ă—1.7\times on the compressible Navier-Stokes solver OpenSBLI. We demonstrate strong and weak scalability up to 4608 cores of CINECA's Marconi supercomputer. We also evaluate our algorithms on Intel's Knights Landing, demonstrating maintained throughput as the problem size grows beyond 16GB, and we do scaling studies up to 8704 cores. The approach is generally applicable to any stencil DSL that provides per loop data access information

    Heterogeneous CPU-GPU Execution of Stencil Applications

    Get PDF

    An abstraction for local computations on structured meshes and its extension to handling multiple materials

    Get PDF
    Computations involving a neighbourhood on structured meshes represents a wide class of applications that includes the simulation of cellular automata, and the solution of partial differential equations (PDEs). In this paper we present an abstraction for describing such computations at a high level, allowing fast experimentation and productivity. The abstraction is designed such that it can be automatically converted to various high-performance implementations. A critical feature of this abstraction is an extension to support a varying number of materials, or species, at each grid point, enabling much more complex simulations

    Beyond 16GB: Out-of-Core Stencil Computations

    Get PDF
    Stencil computations are a key class of applications, widely used in the scientific computing community, and a class that has particularly benefited from performance improvements on architectures with high memory bandwidth. Unfortunately, such architectures come with a limited amount of fast memory, which is limiting the size of the problems that can be efficiently solved. In this paper, we address this challenge by applying the well-known cache-blocking tiling technique to large scale stencil codes implemented using the OPS domain specific language, such as CloverLeaf 2D, CloverLeaf 3D, and OpenSBLI. We introduce a number of techniques and optimisations to help manage data resident in fast memory, and minimise data movement. Evaluating our work on Intel's Knights Landing Platform as well as NVIDIA P100 GPUs, we demonstrate that it is possible to solve 3 times larger problems than the on-chip memory size with at most 15\% loss in efficienc
    corecore